Skip to content

Conversation

@murali-db
Copy link
Collaborator

@murali-db murali-db commented Sep 24, 2025

What changes are proposed in this pull request?

Implements a schema diffing framework for Delta Kernel Rust to enable schema evolution. The diff algorithm uses field IDs (from column mapping) to track fields across schema versions, enabling detection of renames, additions, removals, and modifications at all nesting levels.
Java version here.
Summary of the allowed and blocked changes is TODO.

  • Field ID-based matching: Identifies fields by column mapping ID, not name (enables rename detection)
  • Nested field support: Handles changes in structs, arrays, and maps at any depth
  • LCA filtering: Reports only least common ancestors to reduce noise (e.g., reports "user" added, not "user.name", "user.email", etc.)
  • Breaking change detection: Distinguishes safe changes (nullability loosened) from breaking ones (nullability tightened, type changed)
  • Container nullability tracking: Detects array/map nullability changes separately from element changes

How was this change tested?

Unit tests.

@murali-db murali-db force-pushed the murali-db/schema-evol branch from 654aa32 to 3af7c20 Compare September 24, 2025 11:00
@github-actions github-actions bot added the breaking-change Change that require a major version bump label Sep 24, 2025
@codecov
Copy link

codecov bot commented Oct 1, 2025

Codecov Report

❌ Patch coverage is 98.19639% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.80%. Comparing base (87e3552) to head (f6226f5).

Files with missing lines Patch % Lines
kernel/src/schema/diff.rs 98.19% 23 Missing and 13 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1346      +/-   ##
==========================================
+ Coverage   84.96%   85.80%   +0.84%     
==========================================
  Files         114      115       +1     
  Lines       29445    31441    +1996     
  Branches    29445    31441    +1996     
==========================================
+ Hits        25017    26977    +1960     
- Misses       3220     3243      +23     
- Partials     1208     1221      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@murali-db murali-db force-pushed the murali-db/schema-evol branch from 108283f to ed148a5 Compare October 1, 2025 11:18
@murali-db murali-db changed the title feat: schema diffing poc feat: schema diffing Oct 1, 2025
}

#[test]
fn test_cursed_deeply_nested_complex_types() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cursed

LMAO

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add cursed cases with double nesting. Ex: array<array<>> map<array<>, array<>>, and map<, map<,>>

with structs ^

Copy link
Collaborator

@OussamaSaoudi OussamaSaoudi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some small cleanup. Looks correct to me. Thank you!

murali-db and others added 8 commits October 16, 2025 00:30
Replace string-based field paths with ColumnName type throughout the
schema diffing framework for better consistency with the codebase.

Changes:
- Updated FieldChange, FieldUpdate, and FieldWithPath to use ColumnName
- Updated SchemaDiffError to use ColumnName in error messages
- Refactored path construction to use ColumnName::new() and join()
- Replaced string prefix matching with proper ColumnName comparison
- Updated all test assertions to use ColumnName

All tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Changed unwrap() to expect() with a descriptive message to satisfy
clippy linter requirements while maintaining safety guarantees.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add #![allow(dead_code)] to the schema diff module since this API
is not yet consumed by other parts of the codebase. The functions
are fully tested and ready for use in schema evolution features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
murali-db and others added 28 commits October 16, 2025 00:30
- Combine duplicate test_ancestor_filtering tests into single test with two cases (add and remove)
- Update test_ancestor_filtering_with_mixed_changes to verify both ancestor and descendant changes are included
- Expand test_array_element_struct_field_changes to verify add, remove, and update operations
- Expand test_map_value_struct_field_changes to verify add, remove, and update operations
- Add test_cursed_deeply_nested_complex_types for frightening nested container scenarios

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Verify that array<array<int>> -> array<array<double>> is detected
as TypeChanged on the parent field. This demonstrates the recursion
limitation where we cannot descend into non-struct container types
since there are no field IDs at those intermediate levels.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Renamed the function to better reflect that it classifies changes
between DataType instances, making the purpose more explicit.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…maps

This change improves the schema diff algorithm to correctly handle nested
type changes in arrays and maps by using recursion in classify_data_type_change.

Key improvements:
- Recursively check element types in arrays (e.g., array<array<int>> -> array<array<double>>)
- Recursively check key/value types in maps
- Support combined changes (type + nullability) using FieldChangeType::Multiple
- Only recurse when types actually differ to avoid false positives

This fixes the issue where array<array<struct<a int>>> -> array<array<struct<a int, b int>>>
would be incorrectly blocked. Now nested struct changes are properly detected via field IDs
while also allowing detection of primitive type changes at any nesting level.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Created check_container_nullability_change() helper to reduce code duplication
- Applied DRY principle for nullability checking logic used in arrays and maps
- Added recursive type change detection for nested containers
- Support combined changes (type + nullability) via FieldChangeType::Multiple
- All 28 tests passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Clarified comments for map key/value type change detection
- Existing code already handles recursion for arrays/maps containing structs
- Example: map<array<struct<a int>>, int> -> map<array<struct<a int, b int>>, int>
  is correctly detected via recursive calls to classify_data_type_change
- Map changes already combined via FieldChangeType::Multiple for:
  * Key type changes
  * Value type changes
  * Container nullability changes
- All 28 tests passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Previously, field collection only handled one level of container
nesting (e.g., array<struct<...>>). This refactoring extracts the
container recursion logic into a new helper function that supports
arbitrary nesting depth.

The new `collect_fields_from_datatype` function recursively descends
through container layers (arrays and maps) to find all nested struct
fields, enabling support for deeply nested schemas like:
- array<array<struct<...>>>
- map<array<struct<...>>, array<struct<...>>>
- map<array<array<struct<a int>>>, array<struct<b int>>>

This uses mutual recursion between DataType processing and StructType
processing to ensure all nested structs are discovered regardless of
container nesting depth.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Metadata changes can be unsafe and breaking. For example, changes to row
tracking metadata are not backward compatible. To remain on the safe side,
we now classify all metadata changes as breaking changes.

This updates the `is_breaking_change_type` function to include
`FieldChangeType::MetadataChanged` in the set of breaking change types,
and updates the test expectations accordingly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
When a diff is empty, it should never have breaking changes. This adds
an assertion to verify this invariant in the test_identical_schemas test.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Enhanced test robustness by adding assertions for all three field
categories (added, removed, updated) instead of just checking one.
This ensures tests catch unexpected changes in any category.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…ported

Removed redundant assert_eq!(diff.added_fields.len(), 1) that was
checking the same condition twice in the test.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Added assert!(diff.has_breaking_changes()) after verifying the
nested field path to ensure the breaking changes flag remains
correctly set throughout the test.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Verify that rename operations (which are non-breaking) correctly report
no breaking changes in the array struct element test.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Verify that adding a non-nullable struct correctly reports as a breaking change
in the ancestor filtering test with mixed changes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add comprehensive test cases for nested array nullability changes:
- test_nested_array_nullability_loosened: outer array element nullability loosened
- test_nested_array_nullability_tightened: outer array element nullability tightened
- test_nested_array_inner_nullability_loosened: inner array element nullability loosened
- test_nested_array_inner_nullability_tightened: inner array element nullability tightened

These tests verify that nested array nullability changes are correctly detected
and classified as breaking or non-breaking changes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Added three complex test cases to test_cursed_deeply_nested_complex_types:
- Case 3: array<array<struct>> - doubly nested array with struct elements
- Case 4: map<array<struct>, array<struct>> - map with arrays of structs
- Case 5: map<struct, map<struct, struct>> - map with nested map and structs

These tests verify field changes are correctly detected through extremely
complex nested structures with proper path construction and breaking change
detection.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Updated schema diffing logic to classify field removals as safe operations
rather than breaking changes. This aligns with the principle that dropping
columns is safe - existing data files remain valid and queries referencing
removed fields will fail at query time, but data integrity is maintained.

Changes:
- Modified compute_has_breaking_changes() to not flag field removals
- Updated 5 test assertions that previously expected removals to be breaking
- All 32 schema diff tests pass

Rationale: Field removals should not break existing readers since they can
safely ignore unknown fields in data files. Query-time failures for removed
fields are acceptable as they reflect the intentional schema change.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Performance optimizations:
- Add ColumnName::parent() method for O(1) parent path access
- Remove ColumnName::is_descendant_of() (no longer needed)
- Optimize has_added_ancestor() from O(N*M) to O(M) using parent chain
- Add early return in classify_data_type_change() for identical types

Code quality improvements:
- Add check_field_nullability_change() helper for better code reuse
- Refactor physical name validation into validate_physical_name()
- Simplify nested match statements with helper functions

Test coverage:
- Add Case 6: Field nullability tightening in deeply nested structures
- Add Case 7: Container nullability tightening in deeply nested structures

All 32 tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@murali-db murali-db force-pushed the murali-db/schema-evol branch from 2a065a3 to f6226f5 Compare October 16, 2025 00:31
/// The path to this field (e.g., ColumnName::new(["user", "address", "street"]))
pub path: ColumnName,
/// The type of change that occurred
pub change_type: FieldChangeType,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we make this changes: Vec<FieldChangeType> and avoid the extra indirection through the FieldChangeType::Multiple variant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants